Davies-Bouldin Index

Two characteristics are desired in a cluster: the variance of the data points assigned to it have low variance; the distance to other clusters is big. I'll call the first intra-cluster index and the second inter-cluster index.

Intra-Cluster Index

The intra-cluster index is essentially a measure of the variance of the data assigned to a cluster. The lower this value is the better score we have.As such, the intra-cluster index of cluster $i$ is given by

$$S_i = \frac{1}{T_i} \sum_{j=1}^{T_i} \left \| X_j - A_i \right \|_p$$

where $T_i$ is the number of data points assigned to cluster $i$, $X_j$ are the data points assigned to cluster $i$, $A_i$ is the centroid of the cluster $i$, and $p$ is the type of norm used.

Inter-Cluster Index

The inter-cluster index is a measure of how scattered are the clusters relative to each other. This is index is always measured between two clusters. The value of this index $M_{ij}$ between cluter $i$ and $j$ is:

$$M_{ij} = \left \| A_i - A_j \right \|_p$$

where $A_i$ and $A_j$ are the centroids of the $i$ and $j$ clusters, respectevely, and p is the type of norm being used.

Total index

The final index is a combination of both inter- and intra-cluster scores, for all clusters. The total score between any two clusters is:

$$R_{ij}=\frac{S_i + S_j}{M_{ij}}$$

The lower both intra-cluster scores are and the higher the inter-cluster score is, the smaller the total score will be. A low total score between two clusters represents two good clusters.

However, we still have to combine the scores from all the clusters. We do this by computing $R_{ij}$ between all clusters and selecting the highest for each. It's a pessimistic approach, since we're selecting the worse score associated with each cluster. As such, the worst score for each cluster $i$ is:

$$\max_{j:i \neq j} R_{ij}$$

The final David-Bouldin score is just the average of the worst case scenario scores:

$$DB=\frac{1}{K} \sum_{i=1}^{K} D_i$$

where $K$ is the total number of clusters. The lower DB is, the better evaluated are the clusters.


In [ ]: